Main Approaches
- Over-representation analysis (ORA)
- Gene Set Enrichment Analysis (GSEA) … Functional Class Scoring(FCS)
- Pathway Topology-based methods
Advanced R-course 2025
Bioinformatics Core Facility CECAD
2025-11-21
git clone https://github.com/CECADBioinformaticsCoreFacility/Advanced_R_course_2025.git
https://cecadbioinformaticscorefacility.github.io/Advanced_R_course_2025/
Session 5 :: Enrichment Analysis
What to do with the DE genes?
Are these genes associated with particular biological functions, pathways, or processes more than we’d expect by chance?
Advantages:
Limitations:
Regrouping genes together in meaningful sets based on:
Note
To check if any categories are over-represented, compare the proportion of genes in each category in your gene list to the proportion in the background set and calculate how likely it is to see your observed proportion by chance.
Image source: https://www.nature.com/articles/nrc2036
Human curated:
Domains / Patterns:
Experimental:
Molecular Signatures Database (MSigDB) is a popular resource that compiles many gene sets from various sources.
GO provides a controlled vocabulary to describe gene functions.
Three main domains:
GO terms form a hierarchical structure.
Each GO term may have:
Relationships include:
Hierarchy is not strictly a tree; terms can have multiple parents.
Image source: https://www.nature.com/articles/nrc2036
Because children inherit from parents:
Use tools that:
Focus on most specific significant terms for clearer insights.
packages like topGO, ClueGO, GO-Bayes, GO-Elit etc. help address hierarchy issues.
| Categories | Org Background | DE results | Overrepresented ? |
|---|---|---|---|
| Category 1 | 59/20000 | 45/2000 | Yes/No |
| Category 2 | 150/20000 | 30/2000 | Yes/No |
| Category 3 | 300/20000 | 10/2000 | Yes/No |
| … | … | … | … |
| Symbol | Description |
|---|---|
| \(N\) | Total number of genes (universe) |
| \(k\) | Number of DE genes (your list) |
| \(M\) | Number of genes in a category(GO term) |
| \(n\) | Number of DE genes found in that category |
We test if \(n\) is greater than expected by chance.
The hypergeometric test calculates the probability of observing \(k\) or more DE genes in a pathway of size \(M\), given the total number of genes \(N\) and the number of DE genes \(n\).
The p-value is calculated as: \[ P(X = x) = \frac{\binom{k}{x} \, \binom{N-k}{M-x}}{\binom{N}{M}} \]
Image source: https://metascape.org/blog/?p=122
# Define parameters
N <- 20000 # Total number of genes
k <- 150 # Genes in the GO term
M <- 2000 # Total number of DE genes
n <- 30 # DE genes in the GO term
# Calculate p-value using hypergeometric test
p_value <- phyper(q = n - 1, # q = x-1 for upper tail
m = k, # number of white balls in population
n = N - k, # number of black balls in population
k = M, # number of draws
lower.tail = FALSE)
print(p_value)[1] 0.0001681513
This p-value indicates the probability of observing 30 or more DE genes in the GO term by chance. A low p-value (e.g., < 0.05) suggests that the GO term is significantly over-represented among the DE genes.
Note
Remember to adjust for multiple testing when evaluating p-values across many GO terms.
More on Multiple testing
clusterProfilertopGOGOstatsgprofiler2enrichRReactomePAfgsea (for GSEA)DOSE (for Disease Ontology)pathfindRWebGestaltRclusterProfiler# Load necessary libraries
library(clusterProfiler)
library(org.Mm.eg.db)
library(ggplot2)
library(dplyr)
library(DESeq2)
library(qs)
set.seed(420)
dds <- qread("dds_all_genotype.qs")
uni_genes <- results(dds) |> as.data.frame() |>
filter(!is.na(padj)) |>
rownames()
de_genes_up <- results(dds) |> as.data.frame() |>
filter(!is.na(padj),padj <= 0.05, log2FoldChange > 0)|>
rownames()
de_genes_dn <- results(dds) |> as.data.frame() |>
filter(!is.na(padj),padj <= 0.05, log2FoldChange < 0)|>
rownames()
ego_result_up <- enrichGO(gene = de_genes_up,
universe = uni_genes,
OrgDb = org.Mm.eg.db,
keyType = "SYMBOL",
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.2,
readable = TRUE)
# View results
#DT::datatable(head(ego_result_up))
# Visualize results
barplot(ego_result_up) +
ggtitle("GO Enrichment Analysis - Upregulated Genes : Barplot")# Add similarity matrix to the termsim slot of enrichment result
ego_result_up <- enrichplot::pairwise_termsim(ego_result_up)
# Enrichmap clusters the 30 most significant (by padj) GO terms
# to visualize relationships between terms
emapplot(ego_result_up, showCategory = 30) +
ggtitle("GO Enrichment Analysis - Upregulated Genes : Enrichment Map")Directional Gene Lists
One search :
Two searches: